Key Points 


The COVID-19 pandemic provides an opportunity to replace current educational assessments 
with new tests that provide a net increase in the overall utility of the assessment system 
to stakeholders on the ground—including educators and families—while reducing stake- 


holders’ burden. 


In the short term, federal policymakers should assist states with replacing high-burden, 
low-value summative assessments used to meet Every Student Succeeds Act accountability 
requirements with higher-value interim assessments administered throughout the 


school year. 


In the long term, the US Department of Education should use its resources to encourage 
development of a low-burden, high-value assessment system in which student performance 
data are collected as part of routine interactions with a digital learning platform. 


Standardized educational assessments are often 
criticized as overly burdensome, competing for 
vital instructional time and narrowing the curricu- 
lum to what is tested. They are regarded as expen- 
sive, diverting scarce resources from students, 
teachers, and classrooms to shadowy testing com- 
panies. They are unfair, showing stubborn achieve- 
ment gaps between rich and poor, Black and White, 
and suburban and rural students year after year. 
Yet, as Lindsay Fryer’s report in this series high- 
lights, despite all these criticisms, federal policy- 
makers on both sides of the aisle have continually 
returned to standardized assessment as a key com- 
ponent of school improvement, civil rights, and 
accountability. The reason is simple: Despite these 
assessments’ many real and imagined flaws, they 
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remain the best way of measuring student achieve- 
ment with reliability, fairness, and validity. Indeed, 
it could be said that standardized assessment is the 
worst way of understanding students’ knowledge 
and capabilities—except for all the other ways that 
have been tried. 

This is not to say that educational assessment 
cannot be improved. The COVID-19 pandemic has 
profoundly affected education and has exposed 
significant weaknesses in assessment that have 
been ignored or tolerated for too long. Faced with 
an enormous public health crisis and an unex- 
pected shift to emergency remote learning, all 50 
states, the District of Columbia, Puerto Rico, and 
the Bureau of Indian Education sought and were 


granted assessment waivers from the US Depart- 
ment of Education (ED) for the 2019-20 school 
year. 

At the same time, other educational assessment 
programs either faced chaos and cancellations, as 
the SAT and ACT did,” or launched untested and 
unvalidated alternative virtual models, as the 
Advanced Placement (AP) and International Bac- 
calaureate programs did3 The fortunes of testing 
companies with interim or formative assessment 
products depended on whether they could offer 
remote testing. 

There are many possible explanations for why 
educational assessment in the United States was 
caught flat-footed by the pandemic, including a 
fractured marketplace of testing providers, weak 
federal oversight, underinvestment in broadband 
and computer-based assessment technology, and 
insufficient incentives for research and innovation 
in alternative security and remote delivery models. 
Undoubtedly, many postmortem analyses will be 
written in the years to come. However, like all cri- 
ses, this disruption to testing-as-usual provides an 
opportunity to move beyond finger-pointing and 
rethink the form and function of educational assess- 
ment. 

As educators confront the challenge of remedi- 
ating COVID-19 learning loss—and policymakers 
wrestle with measuring it amid widespread discon- 
tent with testing—now is an apt time to consider 
how much of the current accountability and assess- 
ment regime is truly necessary. In this report, I 
examine the state of testing post-March 2020 and 
explore how a minimally viable, less burdensome 
assessment system might look. 


Classifying Assessments 


Before examining how the future of educational 
assessment could look, it is useful first to clarify 
some concepts and introduce some terminology. 
Assessment programs are often described by their 
purpose or primary use. Summative assessments 
are intended to measure what students have 
learned or can do, in contrast to formative assess- 
ments, which are generally more diagnostic and 
integrated with classroom activities and are designed 
to assist educators in guiding instruction. 
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Between these lie interim assessments, such as 
NWEA’s MAP Growth and Renaissance Learning’s 
Star Assessments.‘ These are often similar in form 
and operation to summative tests but are given 
more often throughout the period of instruction, 
and they are sometimes used to determine 
whether students are on track to meet summative 
benchmarks. Historically, federal law, regulations, 
and rules around assessment—such as the No Child 
Left Behind (NCLB) Act and its successor, the 
Every Student Succeeds Act (ESSA),5 along with 
their associated guidance—have focused on require- 
ments for standardized summative testing at the 
state level. 

States and school districts generally cannot sat- 
isfy all ESSA requirements, let alone meet all their 
educational measurement objectives, through the 
purchase of a single assessment program, product, 
or service. Instead, it is useful to think of the com- 
bination of various tests as an assessment sys- 
tem—one in which individual components may be 
provided by different vendors and designed for dif- 
ferent purposes. As shown below, some past and 
potential future innovations in assessment occur 
at this systemic level, such as the replacement of 
one assessment program with another that meets 
multiple requirements simultaneously. 


Figure 1. Value vs. Burden of Assessments 
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Source: Author. 


Beyond the typical classification of assessment 
programs as summative, interim, or formative, we 
can consider the value they provide to students, 
families, educators, and policymakers and the bur- 
den they place on these stakeholders. The value of 


an assessment program, from the stakeholder’s 
point of view, might be determined by what infor- 
mation it provides and the timeliness of those data, 
whether it has additional benefits (such as college 
credit for scores above a certain level), and what 
legal, bureaucratic, or regulatory requirements it 
satisfies. The burden of a given assessment pro- 
gram is generally a function of how much testing 
time it requires and how much money it costs the 
stakeholders. Figure 1 presents a simple typology 
of assessment programs by these two dimensions. 

The placement of various assessment programs 
into these categories depends a great deal on the 
stakeholder’s point of view. For example, the AP 
program, from the perspective of most families 
and students, is a type B (high burden, high value) 
assessment. Although AP is relatively burdensome, 
requiring much in student instructional time, 
study and preparation, and exam fees (unless 
waived), it returns a lot of potential value—college 
credit and a strong signal of achievement and pre- 
paredness. 

The National Assessment of Educational Pro- 
gress (NAEP),° on the other hand, can be consid- 
ered a type C (low burden, low value) assessment 
for students and families. It does not require much 
of the average student (who likely will not even be 
sampled), nor does it return much to them of use. 
In fact, NAEP by design cannot provide scores for 
individual students and thus is useful only for 
ageregate reporting. 

For the typical student and family—and per- 
haps most teachers and street-level administra- 
tors—the state summative assessments used to 
meet ESSA requirements are type D (high burden, 
low value) tests. They require significant assess- 
ment time every spring for the third through eighth 
grades and at least once in high school (in the case 
of mathematics and English language arts), they 
cost tens of millions of dollars annually, and many 
stakeholders believe they shape instruction in 
ways that may narrow the curriculum (although 
this is an open research question with mixed find- 
ings).7 In exchange for this high burden, the tests 
give little to parents and students; the results come 
at the end of the school year or even over the sum- 
mer and often do not provide much in the way of 
actionable diagnostic data to help guide future 
instruction. 
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For education leaders and policymakers, how- 
ever, the positioning of various assessment pro- 
grams in this typology may be different. For exam- 
ple, from the perspective of a district superinten- 
dent participating in NAEP’s Trial Urban District 
Assessment (TUDA) program, NAEP may be a 
type B test—reasonably high burden in adminis- 
trative requirements for participation but also rel- 
atively high value, since each TUDA administration 
allows that district’s performance to be compared 
to the nation’s, states’, and other districts’ results. 

Yet, even for this audience of local policymak- 
ers, in most cases the state summative assess- 
ments used to satisfy ESSA requirements are still 
type D: high burden, low value. They are perceived 
as useful for satisfying federal reporting require- 
ments in support of civil rights monitoring and 
accountability objectives but not much else, at the 
expense of a lot of time and money that could be 
better used on instruction or more diagnostic assess- 
ment. 


Testing in the COVID-19 Pandemic 


Although the pandemic has created enormous 
challenges for education in general and for assess- 
ment in particular, not all assessment programs 
have fared equally. Regarding the typology above, 
stakeholders have generally pushed to preserve or 
even expand type B testing, ignore type C assess- 
ments, and jettison their type D tests. Arguably, 
type A assessment programs do not exist in the 
marketplace or in the portfolios of states today— 
more on this below. 


Type B Assessments Fared Well. Business has 
been good for test vendors with computer-based 
formative and diagnostic assessments in English 
language arts or mathematics, as local education 
agencies scramble to diagnose COVID-19 learning 
loss and tailor instruction this fall accordingly. 
Perhaps more strikingly, despite concerns about 
fairness and an untested design-and-delivery plat- 
form,® the College Board had an outpouring of 
public support to find a technical solution that 
would enable it to offer AP exams after schools 
shut down in March 2020—and a willingness by 
postsecondary institutions to accept the scores. 
The lesson here is that some assessments are so 


useful that stakeholders will take on additional risk 
or tolerate the bending of the usual principles of 
psychometrics and measurement to preserve 
them. 


Type C Assessments Were Ignored. Even in the 
best of times, low-value, low-burden assessment 
programs are not salient to most stakeholders, at 
least when compared to other types of tests. Dis- 
ruptions due to the pandemic have made it harder 
to collect data even from low-burden assessments, 
such as those used for research and statistical pur- 
poses and given to relatively small samples of stu- 
dents. The ED postponed NAEP math and reading 
assessments until 2022, for example, citing con- 
cerns about the pandemic. 


Type D Assessments Were Avoided. On the 
other hand, as noted above, every state and juris- 
diction sought a waiver to suspend state summative 
assessment in the spring of 2020, and many sought 
waivers for 2021.9 This is even though, in most 
cases, these assessments are low stakes for stu- 
dents, are already administered via cloud-based 
digital platforms, and could be transitioned to 
home administration with some ingenuity and 
resources. The lesson here is that state and district 
leaders have little appetite to find innovative solu- 
tions to operational challenges if there is any pos- 
sibility of a waiver of state summative assessment, 
given the value-burden trade-off. 


Toward a Minimum Viable Assessment 
System 


Assuming there is still consensus at the federal 
level around the fundamental civil rights objective 
of state summative educational assessment as 
enshrined in NCLB and ESSA—ensuring that 
states continue to produce assessment data that 
can be used for subgroup accountability report- 
ing—then the COVID-19 pandemic provides an 
opportunity to replace the current assessments with 
new tests that provide a net increase in the overall 
utility of the assessment system to stakeholders on 
the ground, including educators and families, 
while reducing burden. In fact, “opportunity” may be 
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an understatement; many stakeholders will likely 
refuse to return to the status quo ante COVID-19. 

In other words, the primary goal of educational 
assessment policy in the near term must be to 
drive the replacement of type D tests used to meet 
federal requirements with type B ones (Figure 2). 
How might this look? One idea being discussed in 
the assessment field is replacing the typical end-of- 
year state summative assessment in mathematics, 
English language arts, and science with measure- 
ments derived from a series of interim assessments 
given throughout the year.’° 


Figure 2. Increase the Value of Tests 
Used to Meet ESSA Requirements 
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Source: Author. 


Such a redesigned system might indeed still be 
burdensome in cost and time. (Although, if a state 
or its districts are already using interim assess- 
ments," the net effect will be a reduction in bur- 
den.) But if executed correctly, it could return 
more timely and actionable data to guide instruc- 
tion and support improving student achievement. 

Thus, in the typology presented here, this new 
program would be a type B assessment for educa- 
tors and leaders and possibly for parents, with 
some additional design and outreach, since an 
interim assessment system could provide enhanced 
information on their child’s academic achievement 
multiple times throughout the year. Also, while 
there are important implementation details to 
work out—interim assessment products can gen- 
erally not be used off-the-shelf as they currently 
are operated but must be made suitable for ac- 
countability purposes—this shift does not require 


significant investment in new technology or re- 
search and development. Digital platforms and 
item pools (collections of test questions) exist that 
are sufficient to create a minimum viable version 
with some modest investment, although some 
have questioned the item quality.” 

How could the federal government help? A tran- 
sition from status quo state summative assess- 
ments to an interim-as-summative approach, in 
which interim tests are rolled up into summative 
results for reporting, is already permissible under 
the ESSA. But states may need technical assistance 
in making the transition and developing a com- 
prehensive assessment system using this ap- 
proach. 

Given that there are technical and design issues 
to solve, such as test security, timing of admin- 
istrations, provision of accommodations, and the 
combination of interim data into a summative 
score, states may also need additional resources, 
which will almost surely be passed through to test- 
ing vendors. Perhaps these resources can be 
awarded via a competitive grant program with pro- 
visions for technology and knowledge sharing to 
the field more broadly. 


Looking Long Term 


A transition to roll up interim-to-summative test- 
ing, which would increase the utility of assess- 
ments for stakeholders and possibly reduce burden 
across the entire assessment system, is a worth- 
while first step. But in the longer term, educational 
assessment needs to move beyond simply shuffling 
the burden and increasing value and develop truly 
type A—high value, low burden—tests (Figure 3). 
How might such an assessment program look? 
One idea is a shift to a more embedded assessment 
model, in which student performance data are col- 
lected as part of routine interactions with a digital 
learning platform. These measurement opportuni- 
ties could include engaging, formative, computer- 
based enhanced performance tasks (and the devel- 
opment of the accommodations and accessibility 
technology required to deliver them for all stu- 
dents) that provide significantly more diagnostic 
assessment data while remaining valid, reliable, 
and fair. These (currently few) initiatives'4 all need 
significantly more research and development, 
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beyond the type-D-to-B shift discussed above and 
resource-intensive prototyping and pilot testing. 

Because of this need for significant additional 
research, development, and testing and because of 
the uncertain and risky return to providers on the 
large investment of development resources 
needed, there is likely a second federal role in sup- 
porting the infrastructure, technology, and science 
needed to develop this next generation of type A 
assessment programs. 


Figure 3. Fund Research to Reduce 
Burden 
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Source: Author. 


This federal support could be structured by 
splitting it into two separate programs: one housed 
in ED’s Office of Innovation and Improvement 
and the other in the Institute of Education Sci- 
ences. 

The Office of Innovation and Improvement pro- 
gram might be an assessment-specific innovation 
accelerator and venture partnership designed 
along the lines of the US Air Force’s AFWERX* 
program—composed of several initiatives, includ- 
ing innovation hubs, challenge programs, and the 
Spark initiative, to drive grassroots innovation— 
and a venture arm based on more-traditional 
Small Business Innovation Research and Small 
Business Technology Transfer programs that 
fund small business research and development 
but relax some constraints to allow for adopting 
already-commercialized technologies.’® 

The Institute of Education Sciences program 
should build on that agency’s experience with sci- 


entific grant making and direct supervision of con- 
tracted research and assessment vendors. In par- 
ticular, there are likely synergies with the NAEP 
program, which is considering a radical transfor- 
mation from an outdated digital platform to a 
next-generation assessment design. By wisely 
using federal NAEP contracting dollars to not 
only improve that program but also create technol- 
ogy (not assessment content) that could be 
shared and disseminated for use in state assess- 
ment by vendors, research centers, and state 
agencies, the Institute of Education Sciences 
could be the perfect laboratory for research and 
development and has the capacity to manage this 
activity. 


Conclusion 


State summative assessment is at a crossroads. 
Dissatisfaction with current assessments coupled 
with an unprecedented disruption in all aspects of 
education have created significant pressure for an 
end to these assessments. However, despite its 
flaws and challenges, valid, fair, and reliable stand- 
ardized testing remains an important component 
of consensus goals in education policy at the fed- 
eral level. 

Indeed, those who ignore assessment will be 
condemned to reinvent it. Stakeholders—the fed- 
eral government, states, districts, and the assess- 
ment industry—should make the necessary invest- 


ments to fix testing now, as measurement of what 
students know, and don’t know, is more crucial 
than ever. 
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